python [lxml] - cleaning out html tags
Posted
by sadhu_
on Stack Overflow
See other posts from Stack Overflow
or by sadhu_
Published on 2010-06-01T13:28:34Z
Indexed on
2010/06/01
13:43 UTC
Read the original article
Hit count: 191
from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags
and links=True
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
© Stack Overflow or respective owner